An analysis of almost any social media data can can be rather telling of how subgroups of a population interact with each other on a large scale. We are interested in the content of these interactions and how they vary throughout the United States over the few days that our data spans.
Related work: Anything that inspired you, such as a paper, a web site, or something we discussed in class.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Source, scraping method, cleaning, etc.
Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
If you undertake formal statistical analyses, describe these in detail
What were your findings? Are they what you expect? What insights into the data can you make?
## sentiment count
## anger anger 13605
## anticipation anticipation 52960
## disgust disgust 12668
## fear fear 19942
## joy joy 46690
## sadness sadness 21882
## surprise surprise 22067
## trust trust 76347
From this graph, we noticed that we are missing some time intervals in our data set. We are not sure why this is. The website from which we obtained the data must not have scraped for these times.
## hashtags Freq
## 1 job 51511
## 2 hiring 45428
## 3 jobs 21910
## 4 careerarc 20717
## 5 retail 7454
## 6 hospitality 7311
## 7 nursing 5091
## 8 healthcare 4702
## 9 veterans 4471
## 10 sales 3310
## 11 it 2179
## 12 customerservice 1927
## 13 transportation 1568
## 14 sonic 1520
## 15 manufacturing 1476
## 16 photo 1432
## 17 businessmgmt 1348
## 18 accounting 1053
## 19 engineering 970
## 20 traffic 955
When mapping the positive scores for all tweets, we see that there is a moderate to low score through the US. At this scale, we cannot see a definitive trend at the state level. However, we do see that there are not a lot of tweets generated in the midwest or north west. There does seem that there are slightly more positive tweets from the middle of the country.
When mapping sentiment across all US, we see an overwhelming amount of “trust” tweets. We are not quite sure what this emotion means.
When we filter out trust, we see that surprise and joy seem to be commonly tweeted emotions.
Due to the fact that our location column displays differences in specificity, we built a function that took the latitude and longitude of each tweet and converted it to the state in which the tweet originated from. We then proceeded to add that to our original dataset.
To evaluate overall sentiment by state, we selected the appropriate columns, then grouped and summed by state, making sure to discount missing locations. Maine, Alaska and Hawaii were not included in this survey, however the 48 state count comes from Virginia and the District of Columbia recieving individual designations.
The following heatmap shows the level of positive and negative sentiment across the United States during the 48 hour period of our dataset. Maine, Alaska and Hawaii are blacked out as tweets from those states were not recorded.
We can observe with these two maps that states like California and Texas are consistently the highest ranked, which can be assumed to be population related. It is interesting because the state with the lowest positive and negative sentiment scores is Washington. This could be for two reasons: population difference or that twweets have less sentimental words than other states and therefore don’t generate as strong sentiment scores.